Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] add client side health-check to detect network failures. #31640

Merged
merged 10 commits into from
Jan 13, 2023

Conversation

scv119
Copy link
Contributor

@scv119 scv119 commented Jan 12, 2023

Why are these changes needed?

Occasionally Ray users have seen ray.get hanging, when the node executing the task ray.get is waiting for is preempted and disconnected from the cluster.

As we debug one instance of such hanging, we figured this is caused by that the underlining grpc channel failed to detect this network failure.

To solve this problem, we need add some sort of health check at OS level (TCP keep alive), rpc level (grpc), or application (Ray) level. It seems not easy to configure TCP Keepalive in grpc, and the Ray level involves changing a lot of code, this PR made the change at grpc level.

Also note in Ray we assume network failure as component failure, we set up a more loose timeout to reduce the false positive.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@scv119 scv119 marked this pull request as ready for review January 12, 2023 19:20
@cadedaniel
Copy link
Member

Should we also do this here

channel_ = BuildChannel(address, port, arguments);
or why not?

Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to test this?

And how about the python client's configs?

src/ray/common/ray_config_def.h Show resolved Hide resolved
src/ray/common/ray_config_def.h Show resolved Hide resolved
@scv119
Copy link
Contributor Author

scv119 commented Jan 12, 2023

cc @shomilj

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request change until addressing Ricky's comments!

src/ray/common/ray_config_def.h Show resolved Hide resolved
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need to change config from the python side. I am not sure if we have grpc client other than https://github.com/ray-project/ray/blob/master/python/ray/_private/gcs_pubsub.py. Maybe we should aggregate all grpc client usage to a single file for the global config for python too?

src/ray/common/ray_config_def.h Show resolved Hide resolved
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 13, 2023
src/ray/common/grpc_util.h Outdated Show resolved Hide resolved
@scv119 scv119 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 13, 2023
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 13, 2023
@scv119
Copy link
Contributor Author

scv119 commented Jan 13, 2023

hmm the challenging part is to simulate the failure where we terminate the node without sending FIN on the tcp connection..

@scv119 scv119 merged commit 0ca11dc into ray-project:master Jan 13, 2023
@scv119
Copy link
Contributor Author

scv119 commented Jan 13, 2023

Tried both reboot and preempt spot instances while job is running, the Ray is able to detect in both case node failed.
However, we are not 100% sure we reproduced the exact problem our customer encounters.

@scv119
Copy link
Contributor Author

scv119 commented Jan 13, 2023

#24969

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants